# SQLite database becomes corrupt

I created a single table SQLite db from a csv file using python. I was able to open, browse and query the database using DB Browser for SQLite (browser) immediately after creating it. However, when I try to open the same database a week or two later, having done nothing with it in the intervening time, it causes the browser to hang. In particular, it does appear to open correctly and show the table structure, but hangs when I try to browse or run a query. I suspect the problem is that the files are highly fragmented when they are first created. Later they are defragmented by the os, and that’s how they break. I’ve used sysinternals contig.exe to see how many fragments each file has. The old “corrupt” version has 2 fragments, but the newly created version has 18.
Could defragmentation be the problem?
Acting on Mark Stewart’s answer, I tried connecting to both the recently created db and the one created several weeks ago from the command line. Both exhibited the same behavior as when I opened them in the browser. Here is the session from the command line:

D:CoursespmMachLearnsqlAlchemy>sqlite3 csv2db_loop.db
SQLite version 3.24.0 2018-06-04 19:24:41
Enter ".help" for usage hints.
sqlite> .tables
complaints
sqlite> select count(*) from complaints;
1036460
sqlite> .exit

SQLite version 3.24.0 2018-06-04 19:24:41
Enter ".help" for usage hints.
sqlite> .tables
complaints
sqlite> select count(*) from complaints;


At this point, sqlite3 hung, and I waited several minutes before finally exiting with Ctrl-C. I then used fc to compare the two db files, and tried connecting again to the old file. The continuation of the session above, after Ctrl-C:

D:CoursespmMachLearnsqlAlchemy>fc csv2db_loop.db csv2db_loop_BAD2.db
FC: no differences encountered

SQLite version 3.24.0 2018-06-04 19:24:41
Enter ".help" for usage hints.
sqlite> .tables
complaints
sqlite> select count(*) from complaints;
1036460
sqlite>


Whatever was causing sqlite3 to hang was apparently cured by performing the file comparison. The old file also works correctly now in the browser.
Before trying the file comparison, I had made a copy of the working file, as an experiment to see if a less fragmented copy would exhibit the same behavior as the old versions. It did have fewer fragments– just 1 versus 18 for the file I copied from. However, it did not hang like the older versions. That does seem to indicate that defragmenting is not the cause of the problem. I thought it might have something to do with the file not being in memory. However, I also tried copying an old non-working file, and neither the old file nor the copy worked. After I compared them with each other, though, they both worked.
To summarize the results of my experiments:

1. db files behave correctly immediately after being created
2. files behave incorrectly after not being accessed for several weeks
3. a copy of a file that behaves correctly also behaves correctly
4. a copy of a file that does not behave correctly also does not behave correctly
5. comparing two identical files makes them behave correctly, regardless of whether either or both behaved correctly before being compared

Regarding LC’s answer, You were right, at least about the query just not being finished. I had let it run for several minutes before aborting with ctrl-C, but based on your assessment, I connected to another “stale” file and just let it run while I went and took a nap. I opened task manager while it was running and saw that it was using the cpu and it was doing a lot of disk transfers. I don’t know how long it ran, but it had completed when I came back from a nap and watering some plants– less than 2 hours, but more than 10 minutes. I did notice that it was in sorted order by the primary key when it opened in the browser, rather than in the order in which the records were added. I guess the delay may be due to the necessity of sorting 1,000,000+ records, but that still seems a little suspicious given the machine’s 2+ GHz clock rate and assuming something close to n log(n) sorting complexity. However, while the query was running, its cpu usage was only about 1% max, while disk usage was a pretty constant 0.7 MB/s. I also don’t yet understand how the delay manifests in a file copied from a “stale” file, but not in a file copied from one recently created.
I’m pretty sure there’s no fragmentation within the databases themselves. Each was created with a single table, and no records were ever updated or deleted. My whole point in creating the db in the first place was to compare two different methods in Python for putting a csv data file into a database table. As such, it’s somewhat disconcerting to find that a single query with an aggregate function can take orders of magnitude more time to execute depending on the memory cache. I’m doing another experiment, to see if creating an index will prevent the excessive delay for a stale file.

None of these files is corrupt. The query appears to hang because it just has not yet finished.

The problem might be not only that the database file is fragmented, but also that the pages in the database are fragmented (this typically happens after many deletions/insertions). When SQLite tries to read all rows in order, it reads the pages in a random order, which prevents the OS from prefetching data from disk.

To defragment the database, run VACUUM.

Anyway, the most likely reason is that after two weeks, the database file is no longer in the file cache. After running fc, the query runs faster because the entire file already is in the OS file cache.

Answered by CL. on December 22, 2020

## Related Questions

### Automatically purging binary logs

6  Asked on January 6, 2022 by kriegu

### What is the difference between sys.dm_hadr_ tables and sys.availability_groups / sys.availability_replicas?

1  Asked on January 6, 2022

### MongoDB 4.2 using change streams with read concern majority disabled in a PSA deployment

1  Asked on January 6, 2022 by franz-huebner

### Query customers attributes

1  Asked on January 6, 2022 by mac-portter

### MongoDB different members of cluster choosing different index

1  Asked on January 4, 2022 by marc-esher

### How to search for compound words, and get the word parts returned

1  Asked on January 4, 2022 by lares-dk

### What is the CURRENT_DATE or current date value function for SQL Server?

2  Asked on January 4, 2022

### macOS Catalina – postgreSQL – sysctl.conf still relevant?

2  Asked on January 4, 2022 by crashmeister

### How to deal with a person that may have multiple alias names?

2  Asked on January 2, 2022 by javascriptloser

### How to make delete duplicates faster?

1  Asked on January 2, 2022

### Function that loops through array parameter values to build multiple WHERE clauses (postgres 11.4)

1  Asked on January 2, 2022 by james-randall

### Adding an index without a lag (PostgreSQL)

1  Asked on January 2, 2022

### Leaving out foreign key when creating a postgresql table

0  Asked on January 2, 2022 by gandalf

### Service Broker locks

1  Asked on January 2, 2022

### How to find the most common strings between rows

1  Asked on January 2, 2022 by atmdev

### How can I find a view column’s base table column?

3  Asked on December 31, 2021 by durette

### MySQL/innodb trying to lock uncommitted row from parallel transaction, deadlock in result

1  Asked on December 31, 2021 by triffids

### Why do I keep getting ORA-01033: ORACLE initialization or shutdown in progress when my database is open? (12c)

1  Asked on December 31, 2021 by j-brune

### Searching large dataset, best way to speed this up

4  Asked on December 31, 2021