Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.
Note
Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.
Quick Start
Generate test data using a schema with field constraints:
import pointblank as pb# Define a schema with typed field specificationsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=80), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate 100 rows of test data (seed ensures reproducibility)pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1
7188536481533917197
Doris Martin
d_martin@aol.com
77
pending
2
2674009078779859984
Nancy Gonzalez
nancygonzalez@icloud.com
67
active
3
7652102777077138151
Jessica Turner
jturner@aol.com
78
active
4
157503859921753049
George Evans
georgeevans@zoho.com
36
inactive
5
2829213282471975080
Patricia Williams
pwilliams@outlook.com
75
pending
96
7027508096731143831
Isaiah Murphy
isaiah.murphy@zoho.com
55
active
97
6055996548456656575
Brittany Rodriguez
brodriguez@yandex.com
39
inactive
98
3822709996092631588
Megan Stevens
mstevens26@aol.com
24
inactive
99
1522653102058131295
Pamela Jenkins
pjenkins29@yandex.com
41
active
100
5690877051669225499
Stephanie Santos
stephanie.santos40@gmail.com
75
pending
Field Types
Pointblank provides helper functions for defining typed columns with constraints:
Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.
String Fields with Presets
Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:
This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.
String Fields with Patterns
Use regex patterns to generate strings matching specific formats:
Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.
This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.
Date and Datetime Fields
Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:
The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.
Available Presets
The preset= parameter in string_field() supports many data types:
Personal Data:
name: full name (first + last)
name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
first_name: first name only
last_name: last name only
email: email address
phone_number: phone number in country-specific format
Location Data:
address: full street address
city: city name
state: state/province name
country: country name
postcode: postal/ZIP code
latitude: latitude coordinate
longitude: longitude coordinate
Business Data:
company: company name
job: job title
catch_phrase: business catch phrase
Internet Data:
url: website URL
domain_name: domain name
ipv4: IPv4 address
ipv6: IPv6 address
user_name: username
password: password
Financial Data:
credit_card_number: credit card number
iban: International Bank Account Number
currency_code: currency code (USD, EUR, etc.)
Identifiers:
uuid4: UUID version 4
md5: MD5 hash (32 hex characters)
sha1: SHA-1 hash (40 hex characters)
sha256: SHA-256 hash (64 hex characters)
ssn: Social Security Number (country-specific format)
license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)
Barcodes:
ean8: EAN-8 barcode with valid check digit
ean13: EAN-13 barcode with valid check digit
Date/Time:
date_this_year: a date within the current year
date_this_decade: a date within the current decade
date_between: a random date between 2000 and 2025
date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
future_date: a date up to 1 year in the future
past_date: a date up to 10 years in the past
time: a time value
Text:
word: single word
sentence: full sentence
paragraph: paragraph of text
text: multiple paragraphs
Miscellaneous:
color_name: color name
file_name: file name
file_extension: file extension
mime_type: MIME type
user_agent: browser user agent string (country-weighted)
Country-Specific Data
One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.
Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:
Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:
Gartenstraße 9713, Whg. 474, 60597 Frankfurt am Main
60674
50.212245
8.711472
4
Juliane Münz
Leipzig
Lindenauer Markt 6249, Whg. 489, 04541 Leipzig
04992
51.276862
12.458890
5
Anton Baumann
Köln
Aachener Straße 7203, 50125 Köln
50589
50.967264
6.795838
196
Franziska Wendt
Ulm
Marktplatz 6251, Whg. 535, 89984 Ulm
89226
48.395296
10.001962
197
Lennart Berger
München
Brienner Straße 1390, Whg. 389, 80255 München
80835
48.206882
11.674262
198
Julia Knecht
Ludwigshafen am Rhein
Friedrichstraße 3204, 67944 Ludwigshafen am Rhein
67305
49.473668
8.437782
199
Sebastian Thiel
Gelsenkirchen
Husemannstraße 453, Whg. 273, 45732 Gelsenkirchen
45992
51.568689
7.082531
200
Trude Kaiser
Kassel
Königstraße 1394, 34406 Kassel
34736
51.326544
9.494319
Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:
Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:
This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.
Data Coherence
Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:
Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.
Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.
Business coherence activates when bothjob and company are present. When active:
the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).
Here’s an example showing all three coherence systems working together:
License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.
Supported Countries
Pointblank currently supports 71 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").
Europe (32 countries):
Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)
Americas (9 countries):
Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Costa Rica (CR), Mexico (MX), Peru (PE), United States (US)
Asia-Pacific (17 countries):
Australia (AU), Bangladesh (BD), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), Malaysia (MY), New Zealand (NZ), Pakistan (PK), Philippines (PH), Singapore (SG), South Korea (KR), Sri Lanka (LK), Taiwan (TW), Thailand (TH), Vietnam (VN)
Middle East & Africa (13 countries):
Algeria (DZ), Egypt (EG), Ethiopia (ET), Ghana (GH), Kenya (KE), Morocco (MA), Nigeria (NG), Senegal (SN), South Africa (ZA), Tunisia (TN), Turkey (TR), Uganda (UG), United Arab Emirates (AE)
Additional countries and expanded coverage are planned for future releases.
Mixing Multiple Countries
When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.
Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):
To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:
Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.
By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:
All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.
Frequency-Weighted Sampling
By default, names and cities are sampled uniformly at random from the locale data, giving every entry the same probability of being selected. Real-world distributions are far from uniform though: “James” and “Maria” appear orders of magnitude more often than “Thaddeus” or “Xiomara”, and more people live in New York City than in Flagstaff. The weighted=True parameter makes generated data reflect this natural skew.
With weighting enabled you will see popular names like James, John, Mary, and Patricia appear more frequently, while unusual names surface only occasionally. Similarly, cities like New York, Los Angeles, and Chicago dominate the output while smaller cities appear less often.
The feature works by organizing locale data into four frequency tiers. Each tier has a sampling probability that determines how likely its members are to be selected:
Tier
Probability
Contents
very_common
45%
The top ~10% of entries by real-world frequency
common
30%
The next ~20% of entries
uncommon
20%
The next ~30% of entries
rare
5%
The remaining ~40% of entries
When a value is needed, a tier is first chosen according to these probabilities and then a single entry is picked uniformly at random within that tier. This two-step approach keeps sampling fast while producing a realistic long-tail distribution. Setting weighted=False pools all entries across every tier and samples them uniformly, which can be useful when you want an even spread rather than a realistic distribution.
Weighted sampling combines seamlessly with multi-country mixing. Each country’s batch uses its own tiered data independently, so a mixed dataset will have weighted US names alongside weighted German names:
All 71 supported country locales have tiered name and location data, so weighted=True produces realistic frequency distributions for every country.
Output Formats
The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.
Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.
Using Generated Data for Validation Testing
A common use case is generating test data to validate your validation rules:
# Define a schema with constraintsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate test datatest_data = pb.generate_dataset(schema, n=100, seed=23)# Validate the generated data (it should pass all checks)validation = ( pb.Validate(test_data) .col_vals_gt("user_id", 0) .col_vals_regex("email", r".+@.+\..+") .col_vals_between("age", 18, 100) .col_vals_in_set("status", ["active", "pending", "inactive"]) .interrogate())validation
Pointblank Validation
2026-02-19|04:59:06
Polars
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
E
C
EXT
#4CA64C
1
col_vals_gt()
user_id
0
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
2
col_vals_regex()
email
.+@.+\..+
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
3
col_vals_between()
age
[18, 100]
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
4
col_vals_in_set()
status
active, pending, inactive
✓
100
100 1.00
0 0.00
—
—
—
—
Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.
Pytest Fixture
When Pointblank is installed, a generate_datasetpytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.
The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:
the same test always produces the same data: no manual seed management required.
different tests get different seeds, so they exercise different datasets.
you can still pass an explicit seed= to override the automatic seed when needed.
Basic Usage
Use it by adding generate_dataset to your test function’s parameter list:
Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:
def test_merge_pipeline(generate_dataset): customers = generate_dataset(customer_schema, n=1000, country="US") orders = generate_dataset(order_schema, n=5000)# Each call gets a unique seed derived from the test name + call index,# so both DataFrames are deterministic and different from each other. result = merge_pipeline(customers, orders)assert result.shape[0] >0
Testing Across Locales
The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:
You can also use .default_seed to reproduce the exact dataset outside of pytest:
# In a REPL or notebook, reproduce the data from a failed test:import pointblank as pbdf = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)
Seed Stability
A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.
For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.
Conclusion
Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:
quickly prototype validation rules before working with production data
create reproducible test fixtures for automated testing and CI/CD pipelines
generate locale-specific data for internationalization testing across 71 countries
ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
produce datasets of any size with consistent, realistic values
Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.