Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: How to use XML-Input in Moses? (Philipp Koehn)
2. Re: 12-gram language model ARPA file for 16GB (liling tan)
3. Re: 12-gram language model ARPA file for 16GB
(Marcin Junczys-Dowmunt)
----------------------------------------------------------------------
Message: 1
Date: Mon, 4 May 2015 13:46:43 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] How to use XML-Input in Moses?
To: liling tan <alvations@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDB6d=UNRYzc8sKy-y1QcyUy34T9o8RBFWjU2YY-mrEzhw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Hi,
there is no tool distributed with Moses to do that.
-phi
On Sat, May 2, 2015 at 6:09 PM, liling tan <alvations@gmail.com> wrote:
> Dear Moses devs/users,
>
> I want to use the XML-Input to add constraints when decoding
> (http://www.statmt.org/moses/?n=Advanced.Hybrid#ntoc7)
>
> The example on the Moses page shows only an example with one xml input. I
> have 700,000 of those in a dictionary that I can search and replace using a
> python script to change the decoder's input file. It's rather slow when i'm
> decoding a huge file. I've to search through all 700,000 terms in the
> dictionary for each sentence and do a regex replace.
>
> Is there a cannonical way to add a dictionary for XML-input in moses?
>
> Is there a page that someone can point me to for that?
>
> Regards,
> Liling
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
------------------------------
Message: 2
Date: Mon, 4 May 2015 20:50:07 +0200
From: liling tan <alvations@gmail.com>
Subject: Re: [Moses-support] 12-gram language model ARPA file for 16GB
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJ+YETKc8AB21+juqgAbObNffpkJ+C1Map=X_g9wRZYy-A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Dear Moses devs/users,
Does anyone know an estimation of the HDD space needed for 12grams language
model on 16GB of text? Has anyone tried something similar?
I've tried to clear up my HDD and even with 420GB, it's not enough to store
the 12grams language model.
*STDERR:*
=== 1/5 Counting and sorting n-grams ===
Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 7846035456 bytes == 0x10f8000 @
tcmalloc: large alloc 73229664256 bytes == 0x1d5436000 @
****************************************************************************************************
Unigram tokens 3038737446 types 5924314
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:71091768 2:804524736 3:1508483968 4:2413574144 5:3519795968
6:4827148288 7:6335632384 8:8045247488 9:9955993600 10:12067871744
11:14380880896 12:16895020032
tcmalloc: large alloc 16895025152 bytes == 0x1d5436000 @
tcmalloc: large alloc 2413576192 bytes == 0x8f2a4000 @
tcmalloc: large alloc 3519799296 bytes == 0x5c4490000 @
tcmalloc: large alloc 4827152384 bytes == 0x69614e000 @
tcmalloc: large alloc 6335635456 bytes == 0x7b5cd6000 @
tcmalloc: large alloc 8045248512 bytes == 0x92f6f8000 @
tcmalloc: large alloc 9955999744 bytes == 0xb0ef84000 @
tcmalloc: large alloc 12067872768 bytes == 0xd6064c000 @
tcmalloc: large alloc 14380883968 bytes == 0x12f6176000 @
Last input should have been poison.Last input should have been poison.
util/file.cc:196 in void util::WriteOrThrow(int, const void*, std::size_t)
threw FDException because `ret < 1'.
No space left on device in /tmp/lm1BNUOA (deleted) while writing 1759557632
bytes
Last input should have been poison.
util/file.cc:196 in void util::WriteOrThrow(int, const void*, std::size_t)
threw FDException because `ret < 1'.
No space left on device in /tmp/lmjYruEU (deleted) while writing 339929828
bytes
Regards,
Liling
On Sun, May 3, 2015 at 7:44 PM, liling tan <alvations@gmail.com> wrote:
> Dear Moses devs/users,
>
> For now, I only know that it takes more than 250GB. I've 250GB of free
> space and KenLM got "poisoned" by insufficient space...
>
> Does anyone have an idea how big would a 12-gram language model ARPA file
> trained on 16GB of text become?
>
> STDERR:
>
> === 1/5 Counting and sorting n-grams ===
> Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
> tcmalloc: large alloc 7846035456 bytes == 0x10f4000 @
> tcmalloc: large alloc 73229664256 bytes == 0x1d542e000 @
>
> ****************************************************************************************************
> Unigram tokens 3038737446 types 5924314
> === 2/5 Calculating and sorting adjusted counts ===
> Chain sizes: 1:71091768 2:804524736 3:1508483968 4:2413574144 5:3519795968
> 6:4827148288 7:6335632384 8:8045247488 9:9955993600 10:12067871744
> 11:14380880896 12:16895020032
> tcmalloc: large alloc 16895025152 bytes == 0x1d542e000 @
> tcmalloc: large alloc 2413576192 bytes == 0x8f2a0000 @
> tcmalloc: large alloc 3519799296 bytes == 0x5c4488000 @
> tcmalloc: large alloc 4827152384 bytes == 0x696146000 @
> tcmalloc: large alloc 6335635456 bytes == 0x7b5cce000 @
> tcmalloc: large alloc 8045248512 bytes == 0x92f6f0000 @
> tcmalloc: large alloc 9955999744 bytes == 0xb0ef7c000 @
> tcmalloc: large alloc 12067872768 bytes == 0xd60644000 @
> tcmalloc: large alloc 14380883968 bytes == 0x12f616e000 @
> Last input should have been poison.
> Last input should have been poison.util/file.cc:196 in void
> util::WriteOrThrow(int, const void*, std::size_t) threw FDException because
> `ret < 1'.
> No space left on device in /tmp/PC2o3z (deleted) while writing 5301120368
> bytes
>
> Last input should have been poison.util/file.cc:196 in void
> util::WriteOrThrow(int, const void*, std::size_t) threw FDException because
> `ret < 1'.
> No space left on device in /tmp/PftXeo (deleted) while writing 1941075872
> bytesLast input should have been poison.
>
> util/file.cc:196 in void util::WriteOrThrow(int, const void*, std::size_t)
> threw FDException because `ret < 1'.
> No space left on device in /tmp/CuZcPM (deleted) while writing 2984722272
> bytes
>
> util/file.cc:196 in void util::WriteOrThrow(int, const void*, std::size_t)
> threw FDException because `ret < 1'.
> No space left on device in /tmp/F2bE8A (deleted) while writing 389439488
> bytes
>
> Regards,
> Liling
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150504/c78f2edb/attachment-0001.htm
------------------------------
Message: 3
Date: Mon, 04 May 2015 20:57:56 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] 12-gram language model ARPA file for 16GB
To: moses-support@mit.edu
Message-ID: <5547C134.4090309@amu.edu.pl>
Content-Type: text/plain; charset="windows-1252"
Hi,
I've been happily building 9-gram models, but class-based from some 80
GB gzipped text. I think lmplz used up to 2.5 TB for that?
W dniu 04.05.2015 o 20:50, liling tan pisze:
> Dear Moses devs/users,
>
> Does anyone know an estimation of the HDD space needed for 12grams
> language model on 16GB of text? Has anyone tried something similar?
>
> I've tried to clear up my HDD and even with 420GB, it's not enough to
> store the 12grams language model.
>
> *STDERR:*
>
> === 1/5 Counting and sorting n-grams ===
> Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
> tcmalloc: large alloc 7846035456 bytes == 0x10f8000 @
> tcmalloc: large alloc 73229664256 bytes == 0x1d5436000 @
> ****************************************************************************************************
> Unigram tokens 3038737446 types 5924314
> === 2/5 Calculating and sorting adjusted counts ===
> Chain sizes: 1:71091768 2:804524736 3:1508483968 4:2413574144
> 5:3519795968 6:4827148288 7:6335632384 8:8045247488 9:9955993600
> 10:12067871744 11:14380880896 12:16895020032
> tcmalloc: large alloc 16895025152 bytes == 0x1d5436000 @
> tcmalloc: large alloc 2413576192 bytes == 0x8f2a4000 @
> tcmalloc: large alloc 3519799296 bytes == 0x5c4490000 @
> tcmalloc: large alloc 4827152384 bytes == 0x69614e000 @
> tcmalloc: large alloc 6335635456 bytes == 0x7b5cd6000 @
> tcmalloc: large alloc 8045248512 bytes == 0x92f6f8000 @
> tcmalloc: large alloc 9955999744 bytes == 0xb0ef84000 @
> tcmalloc: large alloc 12067872768 bytes == 0xd6064c000 @
> tcmalloc: large alloc 14380883968 bytes == 0x12f6176000 @
> Last input should have been poison.Last input should have been poison.
> util/file.cc:196 in void util::WriteOrThrow(int, const void*,
> std::size_t) threw FDException because `ret < 1'.
> No space left on device in /tmp/lm1BNUOA (deleted) while writing
> 1759557632 bytes
> Last input should have been poison.
> util/file.cc:196 in void util::WriteOrThrow(int, const void*,
> std::size_t) threw FDException because `ret < 1'.
> No space left on device in /tmp/lmjYruEU (deleted) while writing
> 339929828 bytes
>
> Regards,
> Liling
>
>
> On Sun, May 3, 2015 at 7:44 PM, liling tan <alvations@gmail.com
> <mailto:alvations@gmail.com>> wrote:
>
> Dear Moses devs/users,
>
> For now, I only know that it takes more than 250GB. I've 250GB of
> free space and KenLM got "poisoned" by insufficient space...
>
> Does anyone have an idea how big would a 12-gram language model
> ARPA file trained on 16GB of text become?
>
> STDERR:
>
> === 1/5 Counting and sorting n-grams ===
> Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
> tcmalloc: large alloc 7846035456 bytes == 0x10f4000 @
> tcmalloc: large alloc 73229664256 bytes == 0x1d542e000 @
> ****************************************************************************************************
> Unigram tokens 3038737446 types 5924314
> === 2/5 Calculating and sorting adjusted counts ===
> Chain sizes: 1:71091768 2:804524736 3:1508483968 4:2413574144
> 5:3519795968 6:4827148288 7:6335632384 8:8045247488 9:9955993600
> 10:12067871744 11:14380880896 12:16895020032
> tcmalloc: large alloc 16895025152 bytes == 0x1d542e000 @
> tcmalloc: large alloc 2413576192 bytes == 0x8f2a0000 @
> tcmalloc: large alloc 3519799296 bytes == 0x5c4488000 @
> tcmalloc: large alloc 4827152384 bytes == 0x696146000 @
> tcmalloc: large alloc 6335635456 bytes == 0x7b5cce000 @
> tcmalloc: large alloc 8045248512 bytes == 0x92f6f0000 @
> tcmalloc: large alloc 9955999744 bytes == 0xb0ef7c000 @
> tcmalloc: large alloc 12067872768 bytes == 0xd60644000 @
> tcmalloc: large alloc 14380883968 bytes == 0x12f616e000 @
> Last input should have been poison.
> Last input should have been poison.util/file.cc:196 in void
> util::WriteOrThrow(int, const void*, std::size_t) threw
> FDException because `ret < 1'.
> No space left on device in /tmp/PC2o3z (deleted) while writing
> 5301120368 bytes
>
> Last input should have been poison.util/file.cc:196 in void
> util::WriteOrThrow(int, const void*, std::size_t) threw
> FDException because `ret < 1'.
> No space left on device in /tmp/PftXeo (deleted) while writing
> 1941075872 bytesLast input should have been poison.
>
> util/file.cc:196 in void util::WriteOrThrow(int, const void*,
> std::size_t) threw FDException because `ret < 1'.
> No space left on device in /tmp/CuZcPM (deleted) while writing
> 2984722272 bytes
>
> util/file.cc:196 in void util::WriteOrThrow(int, const void*,
> std::size_t) threw FDException because `ret < 1'.
> No space left on device in /tmp/F2bE8A (deleted) while writing
> 389439488 bytes
>
> Regards,
> Liling
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150504/eb496901/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 103, Issue 7
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 103, Issue 7"
Post a Comment