FAQ
overflow

Great Answers to
Questions About Everything

QUESTION

I have a tab separated value file with 10 million rows each of which has three tab separated values. The first value is a string, the second an integer, and the third another string. How to read efficiently (in terms of timing and memory footprint) the $n^{th}$ to $(n+100)^{th}$ rows of the file into Mathematica as

{
    {_String, _Integer, _String},
    ...
}

?

{ asked by Problemaniac }

ANSWER

For a one-off read you can Skip a number of records:

str = OpenRead["test.tsv"];
Skip[str, Record, n - 1];
data = ReadList[str, {Record, Number, Record}, 100, RecordSeparators -> {"\t", "\n"}];
Close[str];

If you will be reading from the same file many times, it may be worth building an index you can use with SetStreamPosition

str = OpenRead["test.tsv"];
index = Table[pos = StreamPosition[str]; Skip[str, Record]; pos, {100000}];

readlines[n_, m_] := Block[{},
SetStreamPosition[str, index[[n]]];
ReadList[str, {Record, Number, Record}, m, RecordSeparators -> {"\t", "\n"}]]

data = readlines[50000,100]

On my PC building the index took about half a second for 10^5 rows in the file, assuming it scales linearly this would be about a minute for 10^7 rows. So this is only worth doing if you are going to be doing a lot of reads.

{ answered by Simon Woods }
Tweet